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Abstract 



We address the problem of competing with 
any large set of N policies in the non- 
stochastic bandit setting, where the learner 
must repeatedly select among K actions but 
observes only the reward of the chosen action. 

We present a modification of the Exp4 algo- 
rithm of Auer et al. Q, called Exp4.P, which 
with high probability incurs regret at most 
0(\fKT\nN). Such a bound does not hold 
for Exp4 due to the large variance of the 
importance-weighted estimates used in the 
algorithm. The new algorithm is tested em- 
pirically in a large-scale, real- world dataset. 
For the stochastic version of the problem, we 
can use Exp4 . P as a subroutine to compete 
with a possibly infinite set of policies of VC- 
dimension d while incurring regret at most 
0{VTdWT) with high probability. 

These guarantees improve on those of all pre- 
vious algorithms, whether in a stochastic or 
adversarial environment, and bring us closer 
to providing guarantees for this setting that 
are comparable to those in standard super- 
vised learning. 
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1 INTRODUCTION 
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A learning algorithm is often faced with the problem of 
acting given feedback only about the actions that it has 
taken in the past, requiring the algorithm to explore. 
A canonical example is the problem of personalized 
content recommendation on web portals, where the 
goal is to learn which items are of greatest interest 
to a user, given such observable context as the user's 
search queries or geolocation. 

Formally, we consider an online bandit setting where at 
every step, the learner observes some contextual infor- 
mation and must choose one of K actions, each with 
a potentially different reward on every round. After 
the decision is made, the reward of the chosen action 
is revealed. The learner has access to a class of N 
policies, each of which also maps context to actions; 
the learner's performance is measured in terms of its 
regret to this class, defined as the difference between 
the cumulative reward of the best policy in the class 
and the learner's reward. 

This setting goes under different names, including the 
"partial-label problem" [llj, the "associative bandit 
problem" [3], the "contextual bandit problem" [lii ] 
(which is the name we use here), the "fc-armed 
multi-armed) bandit problem with expert advice" 
and "associative reinforcement learning" Policies 
are sometimes referred to as hypotheses or experts, 
and actions are referred to as arms. 

If the total number of steps T (usually much larger 
than K) is known in advance, and the contexts and 
rewards are sampled independently from a fixed but 
unknown joint distribution, a simple solution is to 
first choose actions uniformly at random for 0(T 2 / 3 ) 
rounds, and from that point on use the policy that per- 
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formed best on these rounds. This approach, a variant 
of e-grecdy (see (HI), sometimes called e- first, can be 
shown to have a regret bound of O (T 2 / 3 (K\n iV) 1 / 3 ) 
with high probability 13]. In the full-label setting, 
where the entire reward vector is revealed to the 
learner at the end of each step, the standard ma- 
chinery of supervised learning gives a regret bound of 
0(VT In N) with high probability, using the algorithm 
that predicts according to the policy with the currently 
lowest empirical error rate. 

This paper presents the first algorithm, Exp4.P, that 
with high probability achieves 0(y/TK In N) regret 
in the adversarial contextual bandit setting. This 
improves on the 0(T 2 / 3 (K In A) 1 / 3 ) high probability 
bound in the stochastic setting. Previously, this re- 
sult was known to hold in expectation for the algo- 
rithm Exp4 0, but a high probability statement did 
not hold for the same algorithm, as per-round regrets 
on the order of OiT' 1 / 4 ) were possible [2]. Succeed- 
ing with high probability is important because reliably 
useful methods are preferred in practice. 

The Exp4 . P analysis addresses competing with a finite 
(but possibly exponential in T) set of policies. In the 
stochastic case, e-greedy or epoch-greedy style algo- 
rithms [IH can compete with an infinite set of policies 
with a finite VC-dimension, but the worst-case regret 
grows as 0(T 2 / 3 ) rather than (^(T 1 / 2 ). We show how 
to use Exp4 . P in a black-box fashion to guarantee a 
high probability regret bound of 0(\/Td\nT) in this 
case, where d is the VC-dimension. There are sim- 
ple examples showing that it is impossible to compete 
with a VC-set with an online adaptive adversary, so 
some stochastic assumption seems necessary here. 

This paper advances a basic argument, namely, that 
such exploration problems are solvable in almost ex- 
actly the same sense as supervised learning problems, 
with suitable modifications to existing learning algo- 
rithms. In particular, we show that learning to com- 
pete with any set of strategies in the contextual ban- 
dit setting requires only a factor of K more experience 
than for supervised learning (to achieve the same level 
of accuracy with the same confidence) . 

Exp4 . P does retain one limitation of its predecessors — 
it requires keeping explicit weights over the experts, so 
in the case when TV is too large, the algorithm becomes 
inefficient. On the other hand, Exp4 . P provides a prac- 
tical framework for incorporating more expressive ex- 
pert classes, and it is efficient when N is polynomial 
in K and T. It may also be possible to run Exp4.P 
efficiently in certain cases when working with a family 
of experts that is exponentially large, but well struc- 
tured, as in the case of experts corresponding to all 
prunings of a decision tree [8| . A concrete example of 



this approach is given in Section where an efficient 
implementation of Exp4 . P is applied to a large-scale 
real-world problem. 

Related work: The non-contextual A"-armed ban- 
dit problem was introduced by Robbins [l7j], and an- 
alyzed by Lai and Robbins 12j in the i.i.d. case for 
fixed reward distributions. 

An adversarial version of the bandit problem was in- 
troduced by Auer et al. . They gave an exponential- 
weight algorithm called Exp3 with expected cumula- 
tive regret of 0(V KT) and also Exp3 . P with a similar 
bound that holds with high probability. They also 
showed that these are essentially optimal by proving 
a matching lower bound, which holds even in the i.i.d. 
case. They were also the first to consider the AT-armed 
bandit problem with expert advice, introducing the 
Exp4 algorithm as discussed earlier. Later, McMa- 
han and Streeter [l6| designed a cleaner algorithm 
that improves on their bounds when many irrelevant 
actions (that no expert recommends) exist. Further 
background on online bandit problems appears in [5|. 

Exp4 . P is based on a careful composition of the Exp4 
and Exp3 . P algorithms. We distill out the exact expo- 
nential moment method bound used in these results, 
proving an inequality for martingales (Theorem [1]) to 
derive a sharper bound more directly. Our bound 
is a Freedman-style inequality for martingales 0, 
and a similar approach was taken in Lemma 2 of 
Bartlett et al. [3j. Our bound, however, is more el- 
emental than Bartlett et al.'s since our Theorem can 
be used to prove (and even tighten) their Lemma, but 
not vice versa. 

With respect to competing with a VC-set, a claim 
similar to our Theorem [5] (Section [5]) appears in a 
work of Lazaric and Munos Q. Although they in- 
correctly claimed that Exp4 can be analyzed to give a 
regret bound of 0(KT In N) with high probability, one 
can use Exp4 . P in their proof instead. Besides being 
correct, our analysis is tighter, which is important in 
many situations where such a risk-sensitive algorithm 
might be applied. 

Related to the bounded VC-dimension setting, Kakadc 
and Kalai [13] give a 0(T 3 / 4 ) regret guarantee for the 
transductive online setting, where the learner can ob- 
serve the rewards of all actions, not only those it has 
taken. In Q , Ben-David et al. consider agnostic online 
learning for bounded Littlestone-dimension. However, 
as VC-dimension does not bound Littlestone dimen- 
sion, our work provides much tighter bounds in many 
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To develop a better intuition about the problem, we 
describe several naive strategies and illustrate why 
they fail. These strategies fail even if the rewards of 
each arm are drawn independently from a fixed un- 
known distribution, and thus certainly fail in the ad- 
versarial setting. 

Strategy 1: Use confidence bounds to maintain a set 
of plausible experts, and randomize uniformly over the 
actions predicted by at least one expert in this set. To 
see how this strategy fails, consider two arms, 1 and 
0, with respective deterministic rewards 1 and 0. The 
expert set contains N experts. At every round, one 
of them is chosen uniformly at random to predict arm 
0, and the remaining N — 1 predict arm 1. All of the 
experts have small regret with high probability. The 
strategy will randomize uniformly over both arms on 
every round, incurring expected regret of nearly T/2. 

Strategy 2: Use confidence bounds to maintain a set 
of plausible experts, and follow the prediction of an ex- 
pert chosen uniformly at random from this set. To see 
how this strategy fails, let the set consist of N > 2T ex- 
perts predicting in some set of arms, all with reward 
at every round, and let there be a good expert choosing 
another arm, which always has reward 1. The prob- 
ability we never choose the good arm is (1 — l/N) T . 

We have -Tlogfl - jt ) < T-^r < ¥ < 1, using 
the elementary inequality — log(l — x) < x/(l — x) for 
x £ (0, 1]. Thus (1 - l/N) T > |, and the strategy in- 
curs regret of T with probability greater than 1/2 (as 
it only observes rewards and is unable to eliminate 
any of the bad experts). 



2 PROBLEM SETTING AND 
NOTATION 

Let r(t) £ [0, 1]^ be the vector of rewards, where rj(t) 
is the reward of arm j on round t. Let £ l (t) be the 
/^-dimensional advice vector of expert i on round t. 
This vector represents a probability distribution over 
the arms, in which each entry £j(t) is the (expert's 
recommendation for the) probability of choosing arm 
j. For readability, we always use i £ {1,...,N} to 
index experts and j £ {1, . . . , K} to index arms. 

For each policy tt, the associated expert predicts ac- 
cording to Tr(xt), where Xt is the context available in 
round t. As the context is only used in this fashion 
here, we talk about expert predictions as described 
above. For a deterministic 7r, the corresponding pre- 
diction vector has a 1 in component ir{xt) and in the 
remaining components. 

On each round t, the world commits to r{t) £ [0, 1] K . 
Then the N experts make their recommendations 



| 1 (t), . . . , £ N (t), and the learning algorithm A (seeing 
the recommendations but not the rewards) chooses ac- 
tion j t £ {1, . . . , K}. Finally, the world reveals reward 
rj t (t) to the learner, and this game proceeds to the 
next round. 

We define the return (cumulative reward) of A as 
Ga = Y,t=i r j«(<)- Letting y^t) = £\t) ■ r(t), we also 
define the expected return of expert i, 



t=i 



and G max = max^ Gi . The expected regret of algo- 
rithm A is defined as 



E[G 



A- 



We can also think about bounds on the regret which 
hold with arbitrarily high probability. In that case, we 
can say that the regret is bounded by e with probabil- 
ity 1 — <5, if we have 



Pr[G n 



G A > e] < 5. 



In the definitions of expected regret and the high prob- 
ability bound, the probabilities and expectations are 
taken w.r.t. both the randomness in the rewards r(t) 
and the algorithm's random choices. 

3 A GENERAL RESULT FOR 
MARTINGALES 

Before proving our main result (Theorem [5]), we prove 
a general result for martingales in which the variance is 
treated as a random variable. It is used in the proof of 
Lemma[3]and may also be of independent interest. The 
technique is the standard one used to prove Bernstein's 
inequality for martingales @. The useful difference 
here is that we prove the bound for any fixed estimate 
of the variance rather than any bound on the variance. 

Let X\ , . . . , Xt be a sequence of real- valued random 
variables. Let E f [Y] = E [Y\X U AV X ]. 

Theorem 1. Assume, for all t, that X t < R and that 
E f [Xt] = 0. Define the random variables 

Then for any 5 > 0, with probability at least 1 — 8, we 
have the following guarantee: 



For any V £ 



r MVS) 



DC' 



S < V(e-2)ln(l/5) 



V 

7F 
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and for V' E 



0. 



R 2 \n(l/S) 
e-2 



Algorithm 1 Exp4.P 



5<i?ln(l/<5) + (e-2)-. 

R 



Note that a simple corollary of this theorem is the more 
typical Freedman-style inequality which depends on 
an a priori upper bound, which can be substituted for 
V and V. 

Proof. For a fixed A E [0,1/ R], conditioning on 
X\, . . . , X t -\ and computing expectations gives 

V t [e XXt ] < E t [1 + XX t + (e - 2)A 2 X 2 ] (1) 
= l + (e-2)A 2 E t [X 2 ] (2) 
< exp((e-2)A 2 E t [X 2 ]). (3) 

Eq. (fT|) uses the fact that e z < 1 + z + (e - 2)z 2 for 
z<l. Eq. (J2J uses E t [X t ] = 0. Eq. © uses 1 + z < e z 
for all z. 

Let us define random variables Zq = 1 and, for t > 1, 
Z t = Z t _! • exp (AX t - (e - 2)A 2 E t [X 2 ]). 

Then, 

Et[Z t ] = Z t _i -exp(-( e -2)A 2 E t [A 2 ]) • E t [e AXt ] 
< Z t _! •exp(-(e-2)A 2 E t [A 2 ]) 
• exp ((e - 2)A 2 E t [A 2 ]) - Z t _L 

Therefore, taking expectation over all of the variables 
Xi, . . . , Xt gives 

E [Z T ] < E [Z T -i] < • < E [Z ] = 1. 

By Markov's inequality, Pr [Zt > 1/(5] < <5. Since 
Z T =exp(AS- (e-2)A 2 7), 



we can substitute A = min | , ^/ ('"-^y ^} an< ^ a PP^ 
algebra to prove the theorem. □ 

4 A HIGH PROBABILITY 
ALGORITHM 

The Exp4 . P algorithm is given in Algorithm [1] It 
comes with the following guarantee. 

Theorem 2. Assume that ln(N/S) < KT, and that 
the set of experts includes one which, on each round, 
selects an action uniformly at random. Then, with 
probability at least 1 — 5, 



parameters: 5 > 0, p min E [0, 1/K) 
^we set p min = \J l -§cT^ 

initialization: Set Wi(l) = 1 for i = 1, . . . , N. 
for each £ = 1,2,... 

1. get advice vectors £(£),... ,£ N {t) 

2. set W t — J2iLi w i(t) an d ior j = 1, . . . ,K set 



Pj 



(t) = (1 - K Pmin ) J2 



3. draw action j t randomly according to the proba- 
bilities Px(t), . . . ,PK(t). 

4. receive reward rj t (t) E [0, 1]. 

5. for j = 1, . . . , K set 

rj(t)/pj(t) ttj=Jt 



,(*) = 







otherwise 



6. for i = 1, . 

Ut) 
«i(t) 



Wi(t + 1) 



, N set 



r(£)-f(i) 



Wi(t)e 



The proof of this theorem relies on two lemmas. The 
first lemma gives an upper confidence bound on the ex- 
pected reward of an expert given the estimated reward 
of that expert. 

The estimated reward of an expert is defined as 

T 

Gi = 5>(£). 
t=i 

We also define 



KT ^ 



Lemma 3. Under the conditions of Theorem^ 

< 6. 



Pr 



3i : d > Gi + y/]n(N/6)ai 



GExp4.P > G max 



Proof. Fix i. Recalling that yi(t) — • r(t) and the 
definition of y-i(t) in Algorithm [T] let us further define 
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the random variables X t = yi{t) — yi{t) to which we 
will apply Theorem [TJ Then E f [jji{t)] — yi(t) so that 
E f [Xt] = and X t < 1. Further, we can compute 



Ef [*t ] 



Et [(w(t)-fe(t)) a ] 

E t [(€*(i).f(t)) 5 



< 



E 



we*) 



Note that 



ViQk). 



G t — Gi — X t 



Using (5/iV instead of 5, and setting 7' = KT in The- 
orem [1] gives us 



Pr 



< 
<5/iV 



Noting that e— 2 < 1, and applying a union bound over 
the N experts gives the statement of the lemma. □ 

To state the next lemma, define 

U = max (di + &i ■ yJ\n{N/5)\ . 

Lemma 4. Under the conditions of Theorem^ 



CVhm.i.p - j I - 2\j j U - 2^KTln(N/5) 



-VKTtaN -\n(N/S). 
We can now prove Theorem [3J 

Proof. Taking the statement of Lemma2]and applying 
the result of Lemma [31 and we get, with probability at 
least 1 — 5, 



K\nN 

G ExP 4.p > G max -2\/^^T-]n(N/S) (4) 



-VKTlnN - 2 v / KTln(N/S) 
> G max -6y/KT\n(N/8), 



with Eq. dU) using G max < T. 



5 COMPETING WITH SETS OF 
FINITE VC DIMENSION 

A standard VC-argument in the online setting can be 
used to apply Exp4 . P to compete with an infinite set 
of policies II with a finite VC dimension d, when the 
data is drawn independently from a fixed, unknown 
distribution. For simplicity, this section assumes that 
there are only two actions (K = 2), as that is standard 
for the definition of VC-dimension. 

The algorithm VE chooses an action uniformly at ran- 
dom for the first r = ^T(2d\n ^ + In |) rounds. 
This step partitions II into equivalence classes accord- 
ing to the sequence of advice on the first r rounds. The 
algorithm constructs a finite set of policies II' by tak- 
ing one (arbitrary) policy from each equivalence class, 
and runs Exp4 . P for the remaining T — r steps using 
II' as its set of experts. 

For a set of policies II, define G max (n) as the return of 
the best policy in II at time horizon T . 

Theorem 5. For all distributions D over contexts and 
rewards, for all sets of policies LI with VC dimension 
d, with probability 1 — 5, 



Gve > G n 



9W2T din 



eT 



In- 



Proof. The regret of the initial exploration is bounded 
by r. We first bound the regret of Exp4.P to II', and 
the regret of II' to II. We then optimize with respect 
to r to get the result. 

Sauer's lemma implies that |n'| < {jj) d and hence 
with probability 1—5/2, we can bound Gex P 4.p(II', T— 
t) from below by 



G max(n v) - 6 y/2(T - r)(dln(er/d) + ln(2/<5)). 

To bound the regret of II' to II, pick any sequence of 
feature observations x%, ...,Xt- Sauer's Lemma implies 
the number of unique functions on the observation se- 
quence in II is bounded by (^j-) ■ 

For a uniformly random subset S of size r of the fea- 
ture observations we bound the probability that two 
functions 7T, tt' agree on the subset. Let n = n(ir, 7r') 
be the number of disagreements on the T-length se- 
quence. Then 



□ 



Pr s [Va: £ S n{x) = ir'(x)] = (l - -J < e - T . 

Thus for all tt,tt' g II with n(7r,7r') > — lnl/Jo, we 
have Pr s [Va; £ S n(x) = 7r'(x)] < 5 . 
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Setting So = 5 (^f ) ano - usm S a union bound over 
every pair of policies, we get 

Pr s (37r,7r' s.t. n(7r,vr') > £ (2cZln + In §) 
s.t. Vx e S tt(x) = tt'O)) < 5/2. 

In other words, for all sequences X\, xj- with prob- 
ability 1 — (5/2 over a random subset of size r 



G max (n') > G max (n) — — [ 2<Zln — + In - 



Because the above holds for any sequence x%, xt, it 
holds in expectation over sequences drawn i.i.d. from 
D. Furthermore, we can regard the first r samples as 
the random draw of the subset since i.i.d. distributions 
are exchangeable. 

Consequently, with probability 1 — 6, we have 

Gve > G max (n) - r - — f 2dln— + In- J 
-6y/2T(dln{eT/d)+ln(2/6)). 



Letting t = ^/T(2rfln ^ + In §) and substituting T > 
t we get 



eT 2 

Gve > G max(n) - 9\l 2T(d\n -r + hi-) 



□ 



This theorem easily extends to more than two actions 
(K > 2) given generalizations of the VC-dimension to 
multiclass classification and of Sauer's lemma [7]. 

6 A PRACTICAL IMPROVEMENT 
TO EXP4.P 

Here we give a variant of Step 2 of Algorithm [1] for 
setting the probabilities p 3 - (t) , in the style of [16| . For 
our analysis of Exp4 . P, the two properties we need to 
ensure in setting the probabilities p 3 - (t) are 

2. The value of each pj(t) is at least p m [ n . 

One way to achieve this, as is done in Algorithm [T] 
is to mix in the uniform distribution over all arms. 
While this yields a simpler algorithm and achieves op- 
timal regret up to a multiplicative constant, in general, 
this technique can add unnecessary probability mass to 
badly-performing arms; for example it can double the 
probability of arms whose probability would already 
be set to p miu . 



Algorithm 2 An Alternate Method for Setting Prob- 
abilities in Step 2 of Algorithm [1] 
parameters: w\ (t), W2(t), . . . Wjsr(t) and 



set 



for j = 1 to K set 



N 



Pj 



let A := and / := 1 



for each action j in increasing order according to pj 

1. if Pj (1 — A/Z) > p m in 

for all actions j 1 with py > pj 

p' f = Pf (l-A/l) 

return Vj p'j 

2. else p[: = p min , A := A + p'. - Pj , I :=l- Pj . 



A fix to this, first suggested by [16j, is to ensure the two 
requirements via a different mechanism. We present 
a variant of their suggestion in Algorithm [21 which 
can be used to make Exp4 . P perform better in prac- 
tice with a computational complexity of 0(K In K) for 
computing the probabilities Pj{t) per round. The ba- 
sic intuition of this algorithm is that it enforces the 
minimum probability in order from smallest to largest 
action probability, while otherwise minimizing the ra- 
tio of the initial to final action probability. 

This technique ensures our needed properties, and it is 
easy to verify that by setting probabilities using Algo- 
rithm [5] the proof in Section U remains valid with little 
modification. We use this variant in the experiments 
in Section [7] 

7 EXPERIMENTS 

In this section, we applied Exp4 . P with the improve- 
ment in Section [5] to a large-scale contextual bandit 
problem. The purpose of the experiments is two-fold: 
it gives a proof-of-concept demonstration of the per- 
formance of Exp4 . P in a non-trivial problem, and also 
illustrates how the algorithm may be implemented ef- 
ficiently for special classes of experts. 

The problem we study is personalized news article rec- 
ommendation on the Yahoo! front page [H llij . Each 
time a user visits the front page, a news article out of 
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a small pool of hand-picked candidates is highlighted. 
The goal is to highlight the most interesting articles to 
users, or formally, maximize the total number of user 
clicks on the recommended articles. In this problem, 
we treat articles as arms, and define the payoff to be 1 
if the article is clicked on and otherwise. Therefore, 
the average per-trial payoff of an algorithm/policy is 
the overall click-through rate (or CTR for short). 



Following 15l | . we created B = 5 user clusters and 
thus each user, based on normalized Euclidean dis- 
tance to the cluster centers, was associated with 
a _B-dimensional membership feature d whose (non- 
negative) components always sum up to 1. Experts 
are designed as follows. Each expert is associated with 
a mapping from user clusters to articles, that is, with 
a vector a <E {1, . . . , K} B where a b is the article to be 
displayed for users from cluster b € {1, . . . , B}. When 
a user arrives with feature d, the prediction £ a of ex- 
pert a is = Y^b-a b =j ^b- There are a total of K B 
experts. 

Now we show how to implement Exp4 . P efficiently. Re- 
ferring to the notation in Exp4.P, we have 



0a(*) 



«•(*) 



3 b:a b =j 



E E 

j b:a b =j 



d b (t) 

Pj(t) 



E 



d b (t) 
Pa„(t)' 



Thus, 
ttf a (t+l) 



w a (t)exp ^2d b (t)f ab (t) , 



where 



1 lln(N/d) \ 

~T ['■"■ ' 77777T \ ~kT) 



Unraveling the recurrence, we rewrite w a (t + 1) by 
w a (t+l) = exp^^4(r)/ 06 (r)j 
= exp ^^4(r)/ at (r) 

V b r=l / 



implying that w a (t + 1) can be computed im- 
plicitly by maintaining the quantity gb,j(t) = 
exp fX)r=i db(T)fj ;(t)J for each b and j. Next, 
we compute W t as follows: W t = ^ a Wa ( i ) = 
J2 & U b 9b,a b (t) = Ub(T,j9b,j(t)y Repeating the 
same trick, we have 



E 



E 



d b{t)g b ,]{t) 



which are the inputs to Algorithm [2] to produce the 
final arm-selection probabilities, Pj(t) for all j. There- 
fore, for this structured set of experts, the time com- 
plexity of Exp4 . P is only linear in K and B despite 
the exponentially large size of this set. 

To compare algorithms, we collected historical user 
visit events with a random policy that chose articles 
uniformly at random for a fraction of user visits on 
the Yahoo! front page from May 1 to 9, 2009. This 
data contains over 41M user visits, a total of 253 ar- 
ticles, and about 21 candidate articles in the pool per 
user visit. (The pool of candidate articles changes 
over time, requiring corresponding modifications to 
Exp4 . fQ) . With such random traffic data, we were able 
to obtain an unbiased estimate of the CTR (called 
eCTR) of a bandit algorithm as if it is run in the 
real world fljj . 

Due to practical concerns when applying a bandit algo- 
rithm, it is common to randomly assign each user visit 
to one of two "buckets" : the learning bucket, where the 
bandit algorithm is run, and the deployment bucket, 
where the greedy policy (learned by the algorithm in 
the learning bucket) is used to serve users without re- 
ceiving payoff information. Note that since the ban- 
dit algorithm continues to refine its policy based on 
payoff feedback in the learning bucket, its greedy pol- 
icy may change over time. Its eCTR in the deploy- 
ment bucket thus measures how good this greedy pol- 
icy is. And as the deployment bucket is usually much 
larger than the learning bucket, the deployment eCTR 
is deemed a more important metric. Finally, to protect 
business-sensitive information, we only report normal- 
ized eCTRs, which are the actual eCTRs divided by 
the random policy's eCTR. 

Based on estimates of T and K, we ran Exp4 . P with 
6 = 0.01. The same estimates were used to set 7 in 
Exp4 to minimize the regret bound in Theorem 7.1 of 
0. Table [Q summarizes eCTRs of all three algorithms 
in the two buckets. All differences are significant due 
to the large volume of data. 

First, Exp4.P's eCTR is slightly worse than Exp4 in 



Our modification ensured that a new article's initial 
score was the average of all currently available ones'. 
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Exp4 . P 


Exp4 


e-greedy 


learning CTR 


1.0525 


1.0988 


1.3827 


deployment CTR 


1.6512 


1.5309 


1.4290 



Table 1: Overall click-through rates (eCTRs) of vari- 
ous algorithms on the May 1-9 data set. 



the learning bucket. This gap is probably due to the 
more conservative nature of Exp4.P, as it uses the ad- 
ditional ii terms to control variance, which in turn 
encourages further exploration. In return for the more 
extensive exploration, Exp4.P gained the highest de- 
ployment eCTR, implying its greedy policy is superior 
to Exp4. 

Second, we note a similar comparison to the e-greedy 
variant of Exp4.P. It was the most greedy among the 
three algorithms and thus had the highest eCTR in the 
learning bucket, but lowest eCTR in the deployment 
bucket. This fact also suggests the benefits of using 
the somewhat more complicated soft-max exploration 
scheme in Exp4.P. 
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A PROOF OF LEMMA 4 

Recall that the estimated reward of expert i is defined 
as 



Inequality 2. £<=i < 



Inequality 3. J7 =1 Wi(t)yi(t) 2 < ^ 



\2 ^ a (*) 



Also 



and that 



t=i 



5-< = VAT + — = V fi^t) 



Now letting & = and c 



7 

= PminVln(jV/(5) 

2VKT 



we have 



Wt+i 



E 



+ 1) 



U = max 



(Gi + JiVMJv/i)). 



Lemma 4. Under the conditions of Theorem 2, 



r;^ P u- > ( I - 2\j J U - 2^KTln(N/S) 



-VKThiN-ln(N/5). 



2=1 

JV 

= y^Wj jt) exp (byj(t) + cvj(t)) 

JV 

< ^tDi(t) [1 + 6yi(t) + cOi(t)] 

8=1 

JV 

JV N 



(5) 



JV 



i=l 



JV 



Proof. For the proof, we use 7 — . , 
We have 



iflnJV 



+2b 2 w t (t)m(t) 2 + 2c 2 J2 Mt)^(t) 2 



i=l 



Pj(t) > Pn 



In A 
AT 



,r it (t) A 



2b 



and 



so that 



+2& 



1— 7 1 — 7 1 — 7 



AT A 



(6) 



In A 1 - 7 



&(*) < l/Pmin and f)j(t) < l/p n 



Thus, 



Eq. flD uses e Q < l + a+(e-2)a 2 for a < 1, (a + 6) 2 < 
2a 2 + 2b 2 , and e — 2 < 1. Eq. ((6|) uses inequalities 1 
through 3. 

Now take logarithms, use the inequality ln(l + x) < x, 
sum both sides over T, and we obtain 



— I + y RT Vi(t) I < — (j/i(t) +Ui(t)) 

< 1. 



Let Wi(t) = Wi{t)/Wt- We will need the following in- 
equality: 

Inequality 1. £\ Wi{t)vi(t) < 
As a corollary, we have 

JV JV 

5^tDi(t)«i(f) a < 5^tDi(t)fii(t)— - 



In 



6 , . AT 



t=i 



InAl -7 



< 



AT 26 2 



1 Gex P 4.P + C- h — 

1 — 7 1 — 7 1 



-At/ 



-2c' 



AT AT 
In A I-7' 



Here, we used 



< 



AT A 
In A I-7' 



G 



Also, [|| (on p. 67) prove the following two inequalities 
(with a typo). For completeness, the proofs of all three 
inequalities are given below this proof. 



and 

T 



L.-p4.p = 22r jt (t) 
t=i 



K 



J2 r h (*)=#£ 4 E *i W ^ ^G uniform < KU. 



t=i 



A 

t=l 3 = 1 
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because we assumed that the set of experts includes 
one who always selects each action uniformly at ran- 
dom. 

We also have ln(Wi) = ln(AT) and 



ln(Wr+i) > max(lnw; i (T + 1)) 



max I bGi + c Vi (t) 



t=i 



bU - b^KT\n{N/5). 



Combining then gives 



bU - b^/KT ]n{N/8) - In N 

< 

T ^G Exp4 .P + c^ + ^^ + 2 c 



Solving for Ge X p4.p now gives 



In N I-7 - 



G E xp 4 .p > (l-7-2bK)U- 



1-7 



lnAT 



-(l-j)jKT\n(N/S) - -KT 

b 



-2 C -r- 



b V In AT 



-KT 



> (1 - 7 - 2bK) U - y/KT\n{N/8) (7) 

-KT 



1 c c z 

-- In AT KT — 2 — \. 

b b b V In AT 



I -lJ^L\u-\n{N/5) (8) 



-2VKT\nN- y/KT]n(N/S), 



using 7 > in Eq. (J7J and plugging in the definition 
of 7, b, c in Eq. ©. □ 



We prove Inequalities [T] through [3] below. 
Let Wi(t) = w l (t)/W t . 

Inequality 1. £\ «>i (*)*>*(*) < 137. 



Proof. 
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Inequality 2. £\ =1 w l {t)y l {t) < -fi 7 
Proof. 
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Inequality 3. E 4 =i ®i(i)fc(f) a < ^ 
Proof. 
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